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PISCES  MP  •  Adaptation  of  a  Dusty  Deck  for  Multiprocessing 
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t  INTRODUCTION 

Scalable  multiprocessns  offer  high  performance  with 
relatively  low  cost  Unfortunately,  the  progranuning  model 
required  to  take  advanuge  of  these  architectures  is  a  radical 
departure  from  traditional  paradigms.  Most  users  are  unwilling 
to  discard  the  knowledge  and  expertise  c^xuied  by  exiting 
dusty-deck  programs  in  exchange  for  a  frtster  yet  unproved  and 
unfamiliar  parallel  code.  To  explore  the  potential  for  providing 
vastly  improved  dus^-deck  perform^>c*  while  preserving  the 
knowledge  implicit  in  the  prognm,  we  have  parallelized  die 
device  simulator  PISCES  (1]  on  an  Intel  iPSCyS60"*  hyper¬ 
cube. 

Section  n  gives  a  brief  overview  of  PISCES.  Section  m 
describes  the  methods  used  to  transform  PISCES  into  a  parallel 
code.  A  demonstration  of  the  computational  power  of  the  new 
parallel  device  solver  is  presented  in  Section  IV.  Improvements 
to  the  linear  solver  are  discussed  in  Section  V.  Fin^y,  conclu¬ 
sions  are  given  in  Section  VI. 

n.  Overview  of  PISCES 

PISCES  is  a  two-<Umenmonal  device  simulator  consist¬ 
ing  of  approximately  40,000  lines  of  FORTRAN-77.  Code 
development  has  bem  ongoing  throughout  the  last  ten  years, 
involving  several  generations  of  graduate  students  and  research¬ 
ers.  Although  the  program  structure  is  radier  inelegant,  great 
care  has  been  taken  to  validate  the  code  as  well  as  to  improve 
and  calibrate  the  physical  models.  It  solves  Poisson's  equadan 
and  the  continuity  equations  below; 


v(cv^)= -<Kp  -  n-t.  n;  - 


dn 

dt 


q 


dp  1 
dt  q 


The  equations  are  discretized  using  irregular  triangular 
grid  and  are  solved  using  either  Newton  or  Oummel  nonlinear 
schemes.  A  large  number  of  (rfiyskal  models  are  supporod. 
The  sparse  sysuuns  of  linear  equations  arising  from  these  meth¬ 
ods  are  solved  using  an  optimized  sparse  direct  solver  as 
described  in  [2].  Figure  1  shows  the  major  code  componena. 
Lucas  observed  in  [3]  that  for  even  small  simulation  grids, 
PISCES  sprmds  between  77%  and  96%  of  its  runtime  solving 
the  coupled  nonlinear  device  equations.  The  nonlinear  solver 
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rqreaiedly  forms  element  matrices,  assembles  a  global  matrix, 
solves  the  resulting  sparse  system,  and  updates  the  nonlinear 
solution.  Recent  experience  shows  nonlinear  solution  times 
grow  to  be  more  than  99%  of  die  luntime  for  moderate  to  large 
grid  sizes.  The  remaining  fraction  of  time  is  spent  in  the  user 
intetfice  (UI)  parsing  user  input,  performing  I^,  and  generat¬ 
ing  grid. 


ID.  ParaBalixatioa  of  PISCES 

Typical  PISCES  simulations  require  several  hours  on 
moderate  grid  aizes  and  days  on  large  gri^.  Clearly,  significant 
performance  giina  would  te  welcomed  by  uscn.  Restrucmring 
tile  nonlinear  solver  and  all  of  its  requisite  roiaines  to  run  in 
parallel  would  breadie  new  life  into  the  simulator.  However,  the 
UI  ia  inherently  serial  and  must  be  treated  differently.  In  order 
to  accommodate  this  dichotomy,  we  split  the  code  into  two  pro¬ 
grams.  Rgure  2  shows  the  sinicture  of  PISCES  MP.  The  buOt 
of  PISCES  MP  runs  on  the  hypercube  induding  all  code  for 
nonlinear  sohition,  model  eval^on,  matrix  formation,  matrix 
assembly,  and  linear  solution.  Althmigh  we  left  the  majority  of 
PISCES  code  untouched,  many  changes  were  necessary.  For¬ 
tunately,  changes  rarely  pervaded  the  entire  code.  For  instance, 
we  were  forced  to  add  dau  structures  to  mqi  each  processor's 
local  domain  into  the  global  simulation  grid  and  to  determine 
each  processor's  imponsibility  for  shared  portions  in  each 
domain.  'Wn  were  also  forced  to  modify  those  fAiysicil  models 
and  assembly  routines  tiut  relied  on  non-local  formation.  Fm 
example,  aD  grid  points  attached  to  an  electrode  must  be  given  a 
consistent  potential  value.  D  these  grid  points  are  distributed 
across  mul^le  prooesaars,  the  processors  must  communicite  to 
detennine  the  proper  value.  Finally,  we  replaced  the  linear 
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direct  method,  each  processor  directly  eliminstes  ill  local  equi- 
Uorjs  and  updates  a  dense  block  conesponding  to  the  shared 
equations.  Rather  than.solve  the  dense  shared  block  directly, 
we  use  the  preconditioned  generalized  minimal  residual 
(GMRES)  algorithm  [S],  GMRES  requires  less  global  dau  traf¬ 
fic  than  a  direct  method.  Table  2  compares  the  linear  solution 
times  on  the  9200  grid  point  example  described  earlier.  The 
solution  times  for  a  single  linear  solution  are  given.  This  simu¬ 
lation  required  in  excess  of  120  linear  soludtms.  Not  surpris¬ 
ingly,  the  hilly  direct  method  is  fa^er  for  a  small  mimba  of 
processors  due  to  the  large  amount  of  local  computation  cou¬ 
pled  with  the  small  amount  of  necessary  data  transfer.  As 
expected,  for  larger  numbers  of  processors  the  hybrid  method 
outperforms  the  fully  direct  method  by  reducing  the  amount  of 
shared  dau  transfer.  This  allows  for  the  exploiution  of  greater 
concurroicy  and  results  in  faster  overall  solution  times. 


Hybrid 

Direct 

Computational  Unit 

Tlmefs) 

Timefs) 

iPSC/8604CPU 

11.023 

9.604 

iPSC/860  8  CPU 

6.821 

5.618 

iPSC/S60  16  CPU 

3.942 

4.819 

iPSC/860  32  CPU 

3.768 

6.951 

Table  2 

Comparison  of  linear  soli,  lion  times  on 
9200  grid  point  example 


VL  Conclusions 

In  this  paper,  we  have  described  the  parallelization  of 
PISCES.  We  have  retained  the  valuable  expertise  captured  in 
the  long-term  development  of  the  program.  Our  initial  results 
show  significant  performance  gains.  In  fact,  the  program  not 
only  runs  existing  simulations  faster  but  also  provides  the  capa¬ 
bility  of  solving  vastly  larger  problenu  than  originally  feasi¬ 
ble.  We  have  also  addressed  the  communicadtm  bt^eneck 
created  by  the  direct  solver  when  using  large  numbers  of  pro¬ 
cessors.  We  have  implemented  a  hybrid  solver  that  produces 
greater  parallel  efficiency  in  these  cases. 
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